The Normalized Compression Distance as a Distance Measure in Entity Identification
نویسندگان
چکیده
The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible. We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.
منابع مشابه
Effect of Image Linearization on Normalized Compression Distance
Normalized Information Distance, based on Kolmogorov complexity, is an emerging metric for image similarity. It is approximated by the Normalized Compression Distance (NCD) which generates the relative distance between two strings by using standard compression algorithms to compare linear strings of information. This relative distance quantifies the degree of similarity between the two objects....
متن کاملNormalized Information Distance is Not Semicomputable
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...
متن کاملNonapproximablity of the Normalized Information Distance
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...
متن کاملCover Song Identification Based on Data Compression
We present a system for cover song identification. Our approach combines chord sequence estimation with a similarity metric called normalized compression distance.
متن کاملNormalized Distance Matrix Method for Construction of Phylogenetic Trees Using New Compressor - Dnabit Compress
We define a compression distance, based on a normal compressor to show it is an admissible distance. The first theme concerns the statistical significance of compressed file sizes. Only in recent years have scientists begun to appreciate the fact that compression ratios signify a great deal of important statistical information. In applying the approach, we have used a new DNA sequence compresso...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009